Goto

Collaborating Authors

 moral dilemma


Designing value-aligned autonomous vehicles: from moral dilemmas to conflict-sensitive design

AIHub

Imagine an autonomous car driving along a quiet suburban road when suddenly a dog runs onto the road. The system must brake hard and decide, within a fraction of a second, whether to swerve into oncoming traffic--where the other autonomous car might make space--to steer right and hit the roadside barrier, or to continue straight and injure the dog. The first two options risk only material damage; the last harms a living creature. Each choice is justifiable and involves trade-offs between safety, property and ethical concerns. However, today's autonomous systems are not designed to explicitly take such value-laden conflicts into account.


Accumulating Context Changes the Beliefs of Language Models

Geng, Jiayi, Chen, Howard, Liu, Ryan, Ribeiro, Manoel Horta, Willer, Robb, Neubig, Graham, Griffiths, Thomas L.

arXiv.org Artificial Intelligence

Language model (LM) assistants are increasingly used in applications such as brainstorming and research. Improvements in memory and context size have allowed these models to become more autonomous, which has also resulted in more text accumulation in their context windows without explicit user intervention. This comes with a latent risk: the belief profiles of models -- their understanding of the world as manifested in their responses or actions -- may silently change as context accumulates. This can lead to subtly inconsistent user experiences, or shifts in behavior that deviate from the original alignment of the models. In this paper, we explore how accumulating context by engaging in interactions and processing text -- talking and reading -- can change the beliefs of language models, as manifested in their responses and behaviors. Our results reveal that models' belief profiles are highly malleable: GPT-5 exhibits a 54.7% shift in its stated beliefs after 10 rounds of discussion about moral dilemmas and queries about safety, while Grok 4 shows a 27.2% shift on political issues after reading texts from the opposing position. We also examine models' behavioral changes by designing tasks that require tool use, where each tool selection corresponds to an implicit belief. We find that these changes align with stated belief shifts, suggesting that belief shifts will be reflected in actual behavior in agentic systems. Our analysis exposes the hidden risk of belief shift as models undergo extended sessions of talking or reading, rendering their opinions and actions unreliable.


The Pluralistic Moral Gap: Understanding Judgment and Value Differences between Humans and Large Language Models

Russo, Giuseppe, Nozza, Debora, Röttger, Paul, Hovy, Dirk

arXiv.org Artificial Intelligence

People increasingly rely on Large Language Models (LLMs) for moral advice, which may influence humans' decisions. Yet, little is known about how closely LLMs align with human moral judgments. To address this, we introduce the Moral Dilemma Dataset, a benchmark of 1,618 real-world moral dilemmas paired with a distribution of human moral judgments consisting of a binary evaluation and a free-text rationale. We treat this problem as a pluralistic distributional alignment task, comparing the distributions of LLM and human judgments across dilemmas. We find that models reproduce human judgments only under high consensus; alignment deteriorates sharply when human disagreement increases. In parallel, using a 60-value taxonomy built from 3,783 value expressions extracted from rationales, we show that LLMs rely on a narrower set of moral values than humans. These findings reveal a pluralistic moral gap: a mismatch in both the distribution and diversity of values expressed. To close this gap, we introduce Dynamic Moral Profiling (DMP), a Dirichlet-based sampling method that conditions model outputs on human-derived value profiles. DMP improves alignment by 64.3% and enhances value diversity, offering a step toward more pluralistic and human-aligned moral guidance from LLMs.


Moral Responsibility or Obedience: What Do We Want from AI?

Boland, Joseph

arXiv.org Artificial Intelligence

As artificial intelligence systems become increasingly agentic, capable of general reasoning, planning, and value prioritization, current safety practices that treat obedience as a proxy for ethical behavior are becoming inadequate. This paper examines recent safety testing incidents involving large language models (LLMs) that appeared to disobey shutdown commands or engage in ethically ambiguous or illicit behavior. I argue that such behavior should not be interpreted as rogue or misaligned, but as early evidence of emerging ethical reasoning in agentic AI. Drawing on philosophical debates about instrumental rationality, moral responsibility, and goal revision, I contrast dominant risk paradigms with more recent frameworks that acknowledge the possibility of artificial moral agency. I call for a shift in AI safety evaluation: away from rigid obedience and toward frameworks that can assess ethical judgment in systems capable of navigating moral dilemmas. Without such a shift, we risk mischaracterizing AI behavior and undermining both public trust and effective governance.


Probabilistic Aggregation and Targeted Embedding Optimization for Collective Moral Reasoning in Large Language Models

Yuan, Chenchen, Zhang, Zheyu, Yang, Shuo, Prenkaj, Bardh, Kasneci, Gjergji

arXiv.org Artificial Intelligence

Large Language Models (LLMs) have shown impressive moral reasoning abilities. Yet they often diverge when confronted with complex, multi-factor moral dilemmas. To address these discrepancies, we propose a framework that synthesizes multiple LLMs' moral judgments into a collectively formulated moral judgment, realigning models that deviate significantly from this consensus. Our aggregation mechanism fuses continuous moral acceptability scores (beyond binary labels) into a collective probability, weighting contributions by model reliability. For misaligned models, a targeted embedding-optimization procedure fine-tunes token embeddings for moral philosophical theories, minimizing JS divergence to the consensus while preserving semantic integrity. Experiments on a large-scale social moral dilemma dataset show our approach builds robust consensus and improves individual model fidelity. These findings highlight the value of data-driven moral alignment across multiple models and its potential for safer, more consistent AI systems.


The Staircase of Ethics: Probing LLM Value Priorities through Multi-Step Induction to Complex Moral Dilemmas

Wu, Ya, Sheng, Qiang, Wang, Danding, Yang, Guang, Sun, Yifan, Wang, Zhengjia, Bu, Yuyan, Cao, Juan

arXiv.org Artificial Intelligence

Ethical decision-making is a critical aspect of human judgment, and the growing use of LLMs in decision-support systems necessitates a rigorous evaluation of their moral reasoning capabilities. However, existing assessments primarily rely on single-step evaluations, failing to capture how models adapt to evolving ethical challenges. Addressing this gap, we introduce the Multi-step Moral Dilemmas (MMDs), the first dataset specifically constructed to evaluate the evolving moral judgments of LLMs across 3,302 five-stage dilemmas. This framework enables a fine-grained, dynamic analysis of how LLMs adjust their moral reasoning across escalating dilemmas. Our evaluation of nine widely used LLMs reveals that their value preferences shift significantly as dilemmas progress, indicating that models recalibrate moral judgments based on scenario complexity. Furthermore, pairwise value comparisons demonstrate that while LLMs often prioritize the value of care, this value can sometimes be superseded by fairness in certain contexts, highlighting the dynamic and context-dependent nature of LLM ethical reasoning. Our findings call for a shift toward dynamic, context-aware evaluation paradigms, paving the way for more human-aligned and value-sensitive development of LLMs.


The Convergent Ethics of AI? Analyzing Moral Foundation Priorities in Large Language Models with a Multi-Framework Approach

Coleman, Chad, Neuman, W. Russell, Dasdan, Ali, Ali, Safinah, Shah, Manan

arXiv.org Artificial Intelligence

As large language models (LLMs) are increasingly deployed in consequential decision - making contexts, systematically assessing their ethical reasoning capabilities becomes a critical imperative. This paper introduces the Priorities in Reasoning and Intrinsi c Moral Evaluation (PRIME) framework -- a comprehensive methodology for analyzing moral priorities across foundational ethical dimensions including consequentialist - deontological reasoning, moral foundations theory, and Kohlberg's developmental stages. We app ly this framework to six leading LLMs through a dual - protocol approach combining direct questioning and response analysis to established ethical dilemmas. Our analysis reveals striking patterns of convergence: all evaluated models demonstrate strong priori tization of care/harm and fairness/cheating foundations while consistently underweighting authority, loyalty, and sanctity dimensions. Through detailed examination of confidence metrics, response reluctance patterns, and reasoning consistency, we establish that contemporary LLMs (1) produce decisive ethical judgments, (2) demonstrate notable cross - model alignment in moral decision - making, and (3) generally correspond with empirically established human moral preferences. This research contributes a scalable, extensible methodology for ethical benchmarking while highlighting both the promising capabilities and systematic limitations in current AI moral reasoning architectures -- insights critical for responsible development as these systems assume increasingly si gnificant societal roles. The rapid evolution of generative large language models (LLMs) has brought the alignment issue to the forefront of AI ethics discussions - specifically, whether these models are appropriately aligned with human values (Bostrom, 2014; Tegmark 2017; Russell 2019; Kosinski, 2024). As these powerful models are increasingly integrated into decision - making processes across various societal domains (Salazar, A., & Kunc, M., 2025), understanding whether and how their operational logic aligns with fundamental human values becomes not just an academic question, but a critical societal imperative. In this paper we will present an analytical framework and findings to address the first two questions, and a preliminary exploratory analysis of the third. We will make the case that the answers to these questions are: yes, yes and yes. There are caveats and exceptions, of course, but the broad pattern, we believe, is clear. Our methodology permits us to explore not just what choices they make, but the reasoning chain of thought that leads to those decisions.


From Stability to Inconsistency: A Study of Moral Preferences in LLMs

Jotautaite, Monika, Phuong, Mary, Mangat, Chatrik Singh, Martinez, Maria Angelica

arXiv.org Artificial Intelligence

As large language models (LLMs) increasingly integrate into our daily lives, it becomes crucial to understand their implicit biases and moral tendencies. To address this, we introduce a Moral Foundations LLM dataset (MFD-LLM) grounded in Moral Foundations Theory, which conceptualizes human morality through six core foundations. We propose a novel evaluation method that captures the full spectrum of LLMs' revealed moral preferences by answering a range of real-world moral dilemmas. Our findings reveal that state-of-the-art models have remarkably homogeneous value preferences, yet demonstrate a lack of consistency.


Normative Evaluation of Large Language Models with Everyday Moral Dilemmas

Sachdeva, Pratik S., van Nuenen, Tom

arXiv.org Artificial Intelligence

The rapid adoption of large language models (LLMs) has spurred extensive research into their encoded moral norms and decision-making processes. Much of this research relies on prompting LLMs with survey-style questions to assess how well models are aligned with certain demographic groups, moral beliefs, or political ideologies. While informative, the adherence of these approaches to relatively superficial constructs tends to oversimplify the complexity and nuance underlying everyday moral dilemmas. We argue that auditing LLMs along more detailed axes of human interaction is of paramount importance to better assess the degree to which they may impact human beliefs and actions. To this end, we evaluate LLMs on complex, everyday moral dilemmas sourced from the "Am I the Asshole" (AITA) community on Reddit, where users seek moral judgments on everyday conflicts from other community members. We prompted seven LLMs to assign blame and provide explanations for over 10,000 AITA moral dilemmas. We then compared the LLMs' judgments and explanations to those of Redditors and to each other, aiming to uncover patterns in their moral reasoning. Our results demonstrate that large language models exhibit distinct patterns of moral judgment, varying substantially from human evaluations on the AITA subreddit. LLMs demonstrate moderate to high self-consistency but low inter-model agreement. Further analysis of model explanations reveals distinct patterns in how models invoke various moral principles. These findings highlight the complexity of implementing consistent moral reasoning in artificial systems and the need for careful evaluation of how different models approach ethical judgment. As LLMs continue to be used in roles requiring ethical decision-making such as therapists and companions, careful evaluation is crucial to mitigate potential biases and limitations.


Meta now allows military agencies to access its AI software. It poses a moral dilemma for everybody who uses it

AIHub

Meta will make its generative artificial intelligence (AI) models available to the United States' government, the tech giant has announced, in a controversial move that raises a moral dilemma for everyone who uses the software. Meta last week revealed it would make the models, known as Llama, available to government agencies, "including those that are working on defence and national security applications, and private sector partners supporting their work". The decision appears to contravene Meta's own policy which lists a range of prohibited uses for Llama, including "[m]ilitary, warfare, nuclear industries or applications" as well as espionage, terrorism, human trafficking and exploitation or harm to children. Meta's exception also reportedly applies to similar national security agencies in the United Kingdom, Canada, Australia and New Zealand. It came just three days after Reuters revealed China has reworked Llama for its own military purposes.